Basic ideas of Bayesian Model Averaging
Application of Bayesian Model Averaging using R package BAS
Estimation, Interpretation and Prediction
Advantages of Bayesian Model Averaging over all-or-none model selection
The R package BAS provides ways of carrying out BMA for
linear regression, generalized linear models, and survival or event
history analysis using Cox proportional hazards models. It contains
functions for plotting the BMA posterior distributions of the model
parameters, as well as an image plot function that provides a way of
visualizing the BMA output. The functions bas.lm provide
fast and automatic default ways of doing this for the model classes
considered.
library(BAS)
library(tidyverse)
library(broom.mixed)
To illustrate how BMA takes account of model uncertainty about the
variables to be included in linear regression, we will be using the
nhanes dataset where the variables are described in the
file nhanes-codebook.txt. Load this data with the
load function and specify the data file.
load(file='nhanes1518.rda')
We first explore the URX predictors (i.e. the ones related to
phthalates concentrations), subsetting subset the dataset to include
only the predictors in the weighted quantile sum module, and then filter
out NA values:
nhanes_URX<-nhanes1518%>%
select(BMXBMI, URXUCR, URXCNP,URXCOP,URXECP,URXHIBP,URXMBP,URXMC1,URXMEP,URXMHBP,URXMHH)%>%
na.omit()
We first start exploring the data by plotting a correlation matrix
between the variables of interest. We can use the corrplot
function within the corrplot() package, and adjust the
style to fit our aesthetics of desire. We see that the predictors are
highly correlated, especially between URXMHH and URXECP, and URXMHBP and
URXMBP.
library(corrplot)
corr_mat=cor(nhanes_URX,method="s")
corrplot(corr_mat,
addCoef.col = 1,
number.cex = 0.5) # Change font size of correlation coefficients
# other styles:
# default
# corrplot(corr_mat) # circles
# squares, variables presented in an alphabet order
# corrplot(corr_mat, method = 'color', order = 'alphabet') # squares
# can choose different style for lower and upper triangle, can order the variable by clustering result
#corrplot.mixed(corr_mat, lower = 'shade', upper = 'pie', order = 'hclust')
library(tidyverse)
df_long <- nhanes_URX %>% pivot_longer(everything())
# Plot histograms using ggplot
ggplot(df_long, aes(x = value)) +
geom_histogram() +
facet_wrap(~ name, scales = "free")
We also plot the distributions of our variables of interest involved in the model, and notice that the variables have a long tail in the original scale, which hints at log transformation which we will perform later on.
Motivated from chemical mixtures, one way to deal with highly correlated predictors is to select variable, or select model. Traditionally, analysis often proceeds by first selecting the best model according to some criterion and then learning about the parameters given that the selected model is the underlying truth. However, this approach has potential issues: (1) We cannot quantify model uncertainty through these all-or-none selection methods. (2) There are often many modeling choices that are secondary to the main questions of interest but can still have an important effect on conclusions. An alternative is to learn the parameters for all candidate models and then combine the estimates according to the posterior probabilities of associated model. Bayesian Model Averaging (BMA) carry this model combination idea - Specifically, this is done through a parameter estimate obtained by averaging the predictions of the different models under consideration, each weighted by its model probability. The math formula below illustrate below:
Given quantity of interest \(\Delta\), such as an effect size, a future observation, or utility of a course of action, then its posterior distribution of given data D is
\[pr(M_k|D) = \frac{pr(D|M_k)pr(M_k)}{\Sigma_{l=1}^k pr(D|M_l)pr(M_l)}\]
This is an average of the posterior distributions under each of the models considered, weighted by their posterior model probability.
There are three packages available: BAS,
BMS, and BMA. A thorough comparison are
presented in this paper.
In this module, we will investigate this method known as Bayesian
Model Averaging using BAS package.
Explain the math formula and say we consider Mk to be linear regression then provide the linear model formula here from this website
We fit a BMA model on the dataset using BMI as the response variable, and all of the 10 URX variables as predictor variables, after log transforming both variables:
As seen in the EDA, we perform log transformation due to the long tail and extreme values in the original scale.
To get started, we will use BAS with the Zellner-Siow
Cauchy prior on the coefficients.
nhanes_URX<-log(nhanes_URX)
model <-bas.lm(BMXBMI ~ .,
data = nhanes_URX,
prior = "ZS-null",
modelprior = uniform(), initprobs = "eplogp", method="MCMC",
force.heredity = FALSE, pivot = TRUE
)
summary(model)
## P(B != 0 | Y) model 1 model 2 model 3 model 4
## Intercept 1.0000000 1.0000 1.0000000 1.0000000 1.0000000
## URXUCR 0.9993164 1.0000 1.0000000 1.0000000 1.0000000
## URXCNP 0.9976563 1.0000 1.0000000 1.0000000 1.0000000
## URXCOP 0.6358887 1.0000 0.0000000 0.0000000 1.0000000
## URXECP 0.9985352 1.0000 1.0000000 1.0000000 1.0000000
## URXHIBP 0.9999512 1.0000 1.0000000 1.0000000 1.0000000
## URXMBP 0.9490723 1.0000 1.0000000 1.0000000 1.0000000
## URXMC1 0.0675293 0.0000 0.0000000 0.0000000 1.0000000
## URXMEP 0.9997559 1.0000 1.0000000 1.0000000 1.0000000
## URXMHBP 0.9994629 1.0000 1.0000000 1.0000000 1.0000000
## URXMHH 0.9109375 1.0000 1.0000000 0.0000000 1.0000000
## BF NA 1.0000 0.5747392 0.1126271 0.1144826
## PostProbs NA 0.5219 0.2790000 0.0583000 0.0533000
## R2 NA 0.2726 0.2716000 0.2703000 0.2729000
## dim NA 10.0000 9.0000000 8.0000000 11.0000000
## logmarg NA 869.0414 868.4875542 866.8577204 866.8740607
## model 5
## Intercept 1.00000000
## URXUCR 1.00000000
## URXCNP 1.00000000
## URXCOP 1.00000000
## URXECP 1.00000000
## URXHIBP 1.00000000
## URXMBP 0.00000000
## URXMC1 0.00000000
## URXMEP 1.00000000
## URXMHBP 1.00000000
## URXMHH 1.00000000
## BF 0.05824945
## PostProbs 0.02890000
## R2 0.27100000
## dim 9.00000000
## logmarg 866.19837253
When introduce the functions, also explain what the arguments do
Explain the output of
summary(). What does each column represent? How does r2, BIC, post prob help us to specify the important variables or the model? Also comment on what valuable information we obtain. For example, what are the important variables? What questions it help us answer?
We need to provide example on how to dig insights using the model instead of purely presenting the R coding. Read the Section 7.2 Example 2 of Bayesian Model Averaging: A Tutorial carefully. Please also read the application part of this paper carefully: Bayesian Model Averaging: Theoretical Developments and Practical Applications
In R, the image function may be used to create an image
of the model space that looks like a crossword puzzle:
image(model, rotate=F)
Here, the predictors, including the intercept, are on the y-axis,
while the x-axis corresponds to each different model. The red color
indicates when the variable estimate is positive, the blue color
indicates when the variable estimate is negative, and the beige color is
the color to use when the variable is not included in the model. Here
the imageplot indicates that most of the URX variables are significant
in predicting the BMI level, except for URXMC1 which is not
included in all of the six models produced by BMA.
If we view the image by rows, we can see whether one variable is
included in a particular model. For each variable, there are only 6
models in which it will appear. For example, we see that
URXUCR and URXMEP appear in all the models.
URXMHH appears in the top 2 models with larger posterior
probabilities, but not the last 4 models.
We can also plot the posterior distributions of these coefficients to take a closer look at the distributions:
coef.model <- coef(model)
# plot(coef.model, mfrow=c(3,3), ask = F)
plot(coef.model, ask = F)
This plot agrees with the summary table we obtained above, which
shows that the posterior probability distributions of
URXMC1 and URXCOP have a very large point mass
at 0, while the distribution of URXUCR and
URXMEP have relatively small mass at 0. There is a slighly
little tip at 0 for the variable URXUCR, indicating that
the posterior inclusion probability of URXECP is not
exactly 1. However, since the probability mass for URXUCR
to be 0 is so small, that we are almost certain that URXECP
should be included under Bayesian model averaging.
plot(model, ask = F)
diagnostics(model, type = "pip", pch = 16)
plot(confint(coef.model),estimator = "HPM") # also look into this
## NULL
With the predict function, we can understand uncertainty
besides classification, taking a look at the predictive probabilities
for the two class classification problem:
BAS has methods defined to return fitted values, namely
fitted, using the observed design matrix and predictions at
either the observed data or potentially new values, predict, as with
linear regressions
muhat.BMA <- fitted(model, estimator = "BMA")
BMA <- predict(model, estimator = "BMA")
# predict has additional slots for fitted values under BMA, predictions under each model
names(BMA)
## [1] "fit" "Ybma" "Ypred" "postprobs" "se.fit"
## [6] "se.pred" "se.bma.fit" "se.bma.pred" "df" "best"
## [11] "bestmodel" "best.vars" "estimator"
We plot the two sets of fitted values:
par(mar = c(9, 9, 3, 3))
plot(muhat.BMA, BMA$fit,
pch = 16,
xlab = expression(hat(mu[i])), ylab = expression(hat(Y[i]))
)
abline(0, 1)
We see that they are in perfect agreement, which is always true as the posterior mean for the regression mean function at point \(x\) is the perfect expected posterior value for \(Y\) at \(x\).
In addition to using BMA, we can use the posterior means under model
selection. This corresponds to a decision rule that combines estimation
and selection. BAS currently implements the following
options:
HPM <- predict(model, estimator = "HPM")
# show the indices of variables in the best model where 0 is the intercept
HPM$bestmodel
## [1] 0 1 2 3 4 5 6 8 9 10
Now we explore a little more interpretable version with names:
variable.names(HPM)
## [1] "Intercept" "URXUCR" "URXCNP" "URXCOP" "URXECP" "URXHIBP"
## [7] "URXMBP" "URXMEP" "URXMHBP" "URXMHH"
MPM <- predict(model, estimator = "MPM")
variable.names(MPM)
## [1] "Intercept" "URXUCR" "URXCNP" "URXCOP" "URXECP" "URXHIBP"
## [7] "URXMBP" "URXMEP" "URXMHBP" "URXMHH"
This is the model where all predictors have an inclusion probability greater than or equal to 0.5. This coincides with the HPM if the predictors are all mutually orthogonal, and in this case is the best predictive model under squared error loss.
Note that we can also extract the best model from the attribute in the fitted values as well.
In general, the HPM or MPM are not the best predictive models, which from a Bayesian decision theory perspective would be the model that is closest to BMA predictions under squared error loss.
BPM <- predict(model, estimator = "BPM")
variable.names(BPM)
## [1] "Intercept" "URXUCR" "URXCNP" "URXCOP" "URXECP" "URXHIBP"
## [7] "URXMBP" "URXMEP" "URXMHBP" "URXMHH"
Now we use ggpairs to see how these models compare:
library(GGally)
GGally::ggpairs(data.frame(
HPM = as.vector(HPM$fit), # this used predict so we need to extract fitted values
MPM = as.vector(MPM$fit), # this used fitted
BPM = as.vector(BPM$fit), # this used fitted
BMA = as.vector(BMA$fit)
)) # this used predict
Using the
se.fit = TRUE option with predict we
can also calculate standard deviations for prediction or for the mean
and use this as input for the confint function for the
prediction object.
BPM <- predict(model, estimator = "BPM", se.fit = TRUE)
crime.conf.fit <- confint(BPM, parm = "mean")
crime.conf.pred <- confint(BPM, parm = "pred")
plot(crime.conf.fit)
## NULL
plot(crime.conf.pred)
## NULL
BAS uses a model formula similar to lm to specify the full model with all of the potential predictors. Here we are using the shorthand . to indicate that all remaining variables in the data frame provided by the data argument. Different prior distributions on the regression coefficients may be specified using the prior argument, and include